Name: Ke Chen, Abhinav Bannerjee
UT EID: kc35827, ab45393
General Steps
- Acquire data source (ours was Top 2000 Companies as ranked by Forbes)
- Format data using ETL techniques for upload to Oracle SQL
- Analyze data and prepare visualizations in Tableau and RStudio (using ggplot2 and other packages)
- Create and publish Shiny web application with interactive visualizations
ETL Files
These are the two ETL files we used.
source("../01 Data/1_etl.R",print.eval=FALSE,echo = TRUE)
##
## > setwd("/Users/CK/DV_Finalproject/01 Data")
##
## > file_path <- "1.csv"
##
## > df <- read.csv(file_path)
##
## > names(df) <- gsub("\\.+", "_", names(df))
##
## > str(df)
## 'data.frame': 2008 obs. of 10 variables:
## $ Company : Factor w/ 1999 levels "3i Group","3M",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Country : Factor w/ 62 levels "Argentina","Australia",..: 59 60 27 26 18 54 60 60 52 59 ...
## $ State : Factor w/ 40 levels "Alabama","Arizona",..: 35 20 35 35 35 35 11 11 35 35 ...
## $ Industry : Factor w/ 83 levels "Advertising",..: 52 23 71 36 80 23 64 64 25 52 ...
## $ Rank : int 1573 205 1616 1895 1360 201 248 303 1349 1755 ...
## $ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ Sales : int 1200 31800 982 6700 1700 40100 21300 20000 9500 1900 ...
## $ Profits : int 1000 5000 178 84 420 2600 2300 1800 166 473 ...
## $ Assets : int 7300 31300 69200 14600 60000 45200 41300 27500 30600 8700 ...
## $ Market_Value: int 7000 105100 2100 3400 2600 50300 69900 91700 3200 9300 ...
##
## > measures <- c("Rank", "Year", "Sales", "Profits",
## + "Assets", "Market_Value")
##
## > dimensions <- setdiff(names(df), measures)
##
## > write.csv(format(df, scientific = NA), paste(gsub(".csv",
## + "", file_path), "..csv", sep = ""), row.names = FALSE, na = "")
##
## > tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]",
## + "", gsub(".csv", "", file_path)))
##
## > sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
##
## > if (length(measures) > 1 || !is.na(dimensions)) {
## + for (d in dimensions) {
## + sql <- paste(sql, paste(d, "varchar2(4000),\n"))
## + }
## + .... [TRUNCATED]
##
## > if (length(measures) > 1 || !is.na(measures)) {
## + for (m in measures) {
## + if (m != tail(measures, n = 1))
## + sql <- paste(sq .... [TRUNCATED]
##
## > sql <- paste(sql, ");")
##
## > cat(sql)
## CREATE TABLE 1 (
## -- Change table_name to the table name you want.
## Company varchar2(4000),
## Country varchar2(4000),
## State varchar2(4000),
## Industry varchar2(4000),
## Rank number(38,4),
## Year number(38,4),
## Sales number(38,4),
## Profits number(38,4),
## Assets number(38,4),
## Market_Value number(38,4)
## );
source("../01 Data/2_etl.R",print.eval=FALSE,echo = TRUE)
##
## > setwd("/Users/CK/DV_Finalproject/01 Data")
##
## > file_path <- "2.csv"
##
## > df <- read.csv(file_path)
##
## > names(df) <- gsub("\\.+", "_", names(df))
##
## > str(df)
## 'data.frame': 245 obs. of 2 variables:
## $ CountryName : Factor w/ 245 levels "Afghanistan",..: 199 201 77 163 2 146 182 220 242 1 ...
## $ ContinentName: Factor w/ 9 levels "Africa","Antarctica",..: 1 2 2 3 6 4 7 4 1 3 ...
##
## > measures <- c()
##
## > dimensions <- setdiff(names(df), measures)
##
## > write.csv(format(df, scientific = NA), paste(gsub(".csv",
## + "", file_path), "..csv", sep = ""), row.names = FALSE, na = "")
##
## > tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]",
## + "", gsub(".csv", "", file_path)))
##
## > sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
##
## > if (length(measures) > 1 || !is.na(dimensions)) {
## + for (d in dimensions) {
## + sql <- paste(sql, paste(d, "varchar2(4000),\n"))
## + }
## + .... [TRUNCATED]
##
## > sql <- paste(sql, ");")
##
## > cat(sql)
## CREATE TABLE 2 (
## -- Change table_name to the table name you want.
## CountryName varchar2(4000),
## ContinentName varchar2(4000),
## );
Dataframes
Here are the code for us to query dataset from SQL.
The include the summary of the data and also the head of 20 companies in the world.(Ascending in Rank.)
source("../01 Data/DATAFRAME1.R",echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > require("dplyr")
##
## > df1 <- data.frame(fromJSON(getURL(URLencode("skipper.cs.utexas.edu:5001/rest/native/?query=\"select * from TOPCOMPANY \""),
## + httpheader = c(DB .... [TRUNCATED]
##
## > summary(df1)
## COMPANY COUNTRY
## Akamai Technologies : 2 United States :583
## Credito Valtellinese : 2 Japan :219
## Enagas : 2 China :180
## Erie Indemnity : 2 United Kingdom: 96
## Huntington Ingalls Industries: 2 South Korea : 66
## ICAP : 2 France : 61
## (Other) :1996 (Other) :803
## STATE INDUSTRY RANK
## Total :1425 Regional Banks : 272 Min. : 1.0
## California: 77 Oil & Gas Operations : 93 1st Qu.: 503.8
## New York : 73 Real Estate : 88 Median :1006.5
## Texas : 57 Electric Utilities : 87 Mean :1004.7
## Illinois : 37 Investment Services : 87 3rd Qu.:1508.2
## New Jersey: 23 Telecommunications services: 66 Max. :2000.0
## (Other) : 316 (Other) :1315
## YEAR SALES PROFITS ASSETS
## Min. :2015 Min. : -115 Min. :-7500.0 7200 : 15
## 1st Qu.:2015 1st Qu.: 4100 1st Qu.: 336.8 5200 : 13
## Median :2015 Median : 9300 Median : 637.5 14900 : 12
## Mean :2015 Mean : 19405 Mean : 1449.0 10900 : 11
## 3rd Qu.:2015 3rd Qu.: 18925 3rd Qu.: 1400.0 10300 : 10
## Max. :2015 Max. :485700 Max. :77400.0 10700 : 10
## (Other):1937
## MARKET_VALUE
## Min. : 73
## 1st Qu.: 6400
## Median : 12000
## Mean : 23936
## 3rd Qu.: 24625
## Max. :741800
##
##
## > df1 <- df1 %>% arrange(RANK)
##
## > head(df1, 20)
## COMPANY COUNTRY STATE
## 1 ICBC China Total
## 2 China Construction Bank China Total
## 3 Agricultural Bank of China China Total
## 4 Bank of China China Total
## 5 Berkshire Hathaway United States Nebraska
## 6 JPMorgan Chase United States New York
## 7 Exxon Mobil United States Texas
## 8 PetroChina China Total
## 9 General Electric United States Connecticut
## 10 Wells Fargo United States California
## 11 Toyota Motor Japan Total
## 12 Apple United States California
## 13 Royal Dutch Shell Netherlands Total
## 14 Volkswagen Group Germany Total
## 15 HSBC Holdings United Kingdom Total
## 16 Chevron United States California
## 17 Wal-Mart Stores United States Arkansas
## 18 Samsung Electronics South Korea Total
## 19 Citigroup United States New York
## 20 China Mobile China Total
## INDUSTRY RANK YEAR SALES PROFITS ASSETS
## 1 Major Banks 1 2015 166800 44800 3322000
## 2 Regional Banks 2 2015 130500 37000 2698900
## 3 Regional Banks 3 2015 129200 29100 2574800
## 4 Major Banks 4 2015 120300 27500 2458300
## 5 Investment Services 5 2015 194700 19900 534600
## 6 Major Banks 6 2015 97800 21200 2593600
## 7 Oil & Gas Operations 7 2015 376200 32500 349500
## 8 Oil & Gas Operations 8 2015 333400 17400 387700
## 9 Conglomerates 9 2015 148500 15200 648300
## 10 Major Banks 10 2015 90400 23100 1701400
## 11 Auto & Truck Manufacturers 11 2015 252200 19100 389700
## 12 Computer Hardware 12 2015 199400 44500 261900
## 13 Oil & Gas Operations 13 2015 420400 14900 353100
## 14 Auto & Truck Manufacturers 14 2015 268500 14400 425000
## 15 Major Banks 15 2015 81100 13500 2634100
## 16 Oil & Gas Operations 16 2015 191800 19200 266000
## 17 Discount Stores 16 2015 485700 16400 203700
## 18 Semiconductors 18 2015 195900 21900 209600
## 19 Major Banks 19 2015 93900 7200 1846000
## 20 Telecommunications services 20 2015 104100 17700 209000
## MARKET_VALUE
## 1 278300
## 2 212900
## 3 189900
## 4 199100
## 5 354800
## 6 225500
## 7 357100
## 8 334600
## 9 253500
## 10 278300
## 11 239000
## 12 741800
## 13 195400
## 14 126000
## 15 167700
## 16 201000
## 17 261300
## 18 199400
## 19 156700
## 20 271500
As we can see in above. United States Japan, and China has the most of companies in the ranking list.And the profits’ range start from 7.5 Billion defecit to 77.4 B Surplus.
Also,in the top 20 Companies, There are 9 financial related Companies, which is almost 50 % in the top 20. And These 9 comapnies include 6 major banks, 2 regional banks and 1 investment Services. Moreover, There are 4 Oil $ Gas companies, which is about 20 % in the top 20.
source("../01 Data/DATAFRAME2.R",echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > df2 <- data.frame(fromJSON(getURL(URLencode("skipper.cs.utexas.edu:5001/rest/native/?query=\"select * from CONTINENT\""),
## + httpheader = c(DB = .... [TRUNCATED]
Shiny
The following two files are the code for Shiny.
source("../04 Shiny/ui.R",echo = TRUE)
##
## > library(shiny)
##
## > navbarPage(title = "Final Project", tabPanel(title = "Scatter Plot",
## + fluidRow(column(5), column(7, actionButton(inputId = "scatter",
## + .... [TRUNCATED]
## <nav class="navbar navbar-default navbar-static-top" role="navigation">
## <div class="container">
## <div class="navbar-header">
## <span class="navbar-brand">Final Project</span>
## </div>
## <ul class="nav navbar-nav">
## <li class="active">
## <a href="#tab-2600-1" data-toggle="tab" data-value="Scatter Plot">Scatter Plot</a>
## </li>
## <li>
## <a href="#tab-2600-2" data-toggle="tab" data-value="Crosstab">Crosstab</a>
## </li>
## <li>
## <a href="#tab-2600-3" data-toggle="tab" data-value="Bar Chart">Bar Chart</a>
## </li>
## <li>
## <a href="#tab-2600-4" data-toggle="tab" data-value="Histogram">Histogram</a>
## </li>
## <li>
## <a href="#tab-2600-5" data-toggle="tab" data-value="Pie Chart">Pie Chart</a>
## </li>
## </ul>
## </div>
## </nav>
## <div class="container-fluid">
## <div class="tab-content">
## <div class="tab-pane active" data-value="Scatter Plot" id="tab-2600-1">
## <div class="row">
## <div class="col-sm-5"></div>
## <div class="col-sm-7">
## <button id="scatter" type="button" class="btn btn-default action-button">Generate Scatter Plot</button>
## </div>
## </div>
## <div class="row">
## <div class="col-sm-1 col-sm-offset-1">
## <div class="col-sm-8">
## <div id="scatterplot" class="shiny-plot-output" style="width: 100% ; height: 400px"></div>
## </div>
## </div>
## </div>
## </div>
## <div class="tab-pane" data-value="Crosstab" id="tab-2600-2">
## <div class="row">
## <div class="col-sm-2"></div>
## <div class="col-sm-5">
## <div class="form-group shiny-input-container">
## <label class="control-label" for="kpi1">Count Rank(low fence)</label>
## <input class="js-range-slider" id="kpi1" data-min="1" data-max="158" data-from="5" data-step="1" data-grid="true" data-grid-num="9.8125" data-grid-snap="false" data-prettify-separator="," data-keyboard="true" data-keyboard-step="0.636942675159236" data-drag-interval="true" data-data-type="number"/>
## </div>
## </div>
## <div class="col-sm-5">
## <div class="form-group shiny-input-container">
## <label class="control-label" for="kpi2">Count Rank (medium fence)</label>
## <input class="js-range-slider" id="kpi2" data-min="1" data-max="158" data-from="20" data-step="1" data-grid="true" data-grid-num="9.8125" data-grid-snap="false" data-prettify-separator="," data-keyboard="true" data-keyboard-step="0.636942675159236" data-drag-interval="true" data-data-type="number"/>
## </div>
## </div>
## </div>
## <div class="col-sm-8">
## <div id="crosstab" class="shiny-plot-output" style="width: 100% ; height: 400px"></div>
## </div>
## </div>
## <div class="tab-pane" data-value="Bar Chart" id="tab-2600-3">
## <div class="row">
## <div class="col-sm-5"></div>
## <div class="col-sm-7">
## <button id="bar" type="button" class="btn btn-default action-button">Generate Bar Chart</button>
## </div>
## </div>
## <div class="col-sm-8">
## <div id="barchart" class="shiny-plot-output" style="width: 100% ; height: 400px"></div>
## </div>
## </div>
## <div class="tab-pane" data-value="Histogram" id="tab-2600-4">
## <div class="row">
## <div class="col-sm-3"></div>
## <div class="col-sm-5">
## <div class="form-group shiny-input-container">
## <label class="control-label" for="hist1">Binsize</label>
## <input class="js-range-slider" id="hist1" data-min="0" data-max="5000" data-from="1000" data-step="1" data-grid="true" data-grid-num="10" data-grid-snap="false" data-prettify-separator="," data-keyboard="true" data-keyboard-step="0.02" data-drag-interval="true" data-data-type="number"/>
## </div>
## </div>
## <div class="col-sm-8">
## <div id="Histogram" class="shiny-plot-output" style="width: 100% ; height: 400px"></div>
## </div>
## </div>
## </div>
## <div class="tab-pane" data-value="Pie Chart" id="tab-2600-5">
## <div class="row">
## <div class="col-sm-5"></div>
## <div class="col-sm-7">
## <div id="radio" class="form-group shiny-input-radiogroup shiny-input-container">
## <label class="control-label" for="radio">
## <h3>Slect the Measure</h3>
## </label>
## <div class="shiny-options-group">
## <div class="radio">
## <label>
## <input type="radio" name="radio" value="1" checked="checked"/>
## <span>Count of Rank</span>
## </label>
## </div>
## <div class="radio">
## <label>
## <input type="radio" name="radio" value="2"/>
## <span>Market Value</span>
## </label>
## </div>
## </div>
## </div>
## </div>
## </div>
## <div class="row">
## <div class="col-sm-5"></div>
## <div class="col-sm-7">
## <button id="piechar" type="button" class="btn btn-default action-button">Generate Pie Chart</button>
## </div>
## </div>
## <div class="col-sm-8">
## <div id="piechart" class="shiny-plot-output" style="width: 100% ; height: 400px"></div>
## </div>
## </div>
## </div>
## </div>
source("../04 Shiny/server.R",echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > require(ggplot2)
##
## > require(dplyr)
##
## > require(shiny)
##
## > require(shinydashboard)
##
## > require(leaflet)
##
## > shinyServer(function(input, output) {
## + df1 <- eventReactive(input$scatter, {
## + data.frame(fromJSON(getURL(URLencode("skipper.cs.utexas. ..." ... [TRUNCATED]
All shiny graph codes are included here. We use grammar of graph and Data wrangling to create these graphs.
Visualization
We started our project by following the steps in Tableau Methodology. We tried to use different plots to discover interesting things. We generated a Barchart, Crosstab, Histogram,Scatter Plot,Box Plot,Map,Pie Chart and Bubble to find some interesting facts.
Barchart allows us to see which industries take up most of the ranks
Crosstab displays number of ranked companies in every continent based on industry
This is a breakdown of % total profits and % total average profits for each industry
The boxplot visualizes the highest profits based on industry in each continent The following boxplot is the plot after we exclude some extreme outliers.(Computer Hardware in North America,Major Bank in Asia, NA, Australia and Central America).
Pie Chart shows market value of industries
This chart relates industry and market value by size
This map shows a state-by-state breakdown of notable ranked companies and their respective industry